Project Report

Introduction

The 2019–20 coronavirus pandemic is an ongoing global pandemic of coronavirus disease. On 11 March 2020, the World Health Organization declared the outbreak a pandemic. As of May 4th 2020, over 3,500,000 cases have been confirmed in more than 120 countries and territories, with major outbreaks in United States and Europe.

Our project entails a thorough analysis over available data on COVID-19. Specifically, we conducted our data analysis over four questions of interest. They are

  1. What is the infection trends of COVID-19 in the United States?

  2. What might be the real patients number?

  3. What is the best model to predict Number of Confirmed Cases?

  4. What factors are playing crucial roles in determining whether a country is more likely to have a massive outbreak?

  5. What insights can we obtain from the previous pandemic in better preparing global solutions for COVID-19?

Our data mainly come from Kaggle and WHO. The significance and related interest of such projects go without saying as millions of people’s well-being are under threat. A generalization of what we did in terms of modeling includes utilizing SIR model to simulate disease developing trend, featuring SVM model to predict total infection numbers and interactively visualizing infection map using plotly. We also combined logistic regression method with above models under multiple parameter tuning effort.

In [1]:
# data import and cleaning
import akshare as ak
import numpy as np
import pandas as pd
from pandas import DataFrame
import datetime as dt
import re
import random
import collections
from functools import reduce

# plot
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px
import plotly.figure_factory as ff

# math and scipy
import math
import scipy.integrate
from scipy.optimize import curve_fit

# sklearn
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.linear_model import LinearRegression, ElasticNet, Lasso, Ridge
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.svm import SVR
from sklearn.preprocessing import PolynomialFeatures
from sklearn import metrics

import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

Question 1: What is the trends of COVID-19 around the World and how is the situation in the United States

a. Analysis of total confirmed cases, deaths from January 22 to April 30, 2020

In [2]:
# get data from github by akshare API
global_accumulate = ak.covid_19_csse_daily(date="2020-04-30")
global_accumulate.head()
Out[2]:
FIPS Admin2 Province_State Country_Region Last_Update Lat Long_ Confirmed Deaths Recovered Active Combined_Key
0 45001.0 Abbeville South Carolina US 2020-05-01 02:32:28 34.223334 -82.461707 31 0 0 31 Abbeville, South Carolina, US
1 22001.0 Acadia Louisiana US 2020-05-01 02:32:28 30.295065 -92.414197 130 10 0 120 Acadia, Louisiana, US
2 51001.0 Accomack Virginia US 2020-05-01 02:32:28 37.767072 -75.632346 264 4 0 260 Accomack, Virginia, US
3 16001.0 Ada Idaho US 2020-05-01 02:32:28 43.452658 -116.241552 671 16 0 655 Ada, Idaho, US
4 19001.0 Adair Iowa US 2020-05-01 02:32:28 41.330756 -94.471059 1 0 0 1 Adair, Iowa, US
In [3]:
# drop some unnecessary columns
global_accumulate = global_accumulate.drop(['FIPS', 'Admin2', 'Last_Update', 'Combined_Key', 'Active'] ,axis=1)
global_accumulate = global_accumulate.rename(columns={'Province_State': 'Provice/State', 
                                                      'Country_Region': 'Country/Region','Long_': 'Long'})
global_accumulate
Out[3]:
Provice/State Country/Region Lat Long Confirmed Deaths Recovered
0 South Carolina US 34.223334 -82.461707 31 0 0
1 Louisiana US 30.295065 -92.414197 130 10 0
2 Virginia US 37.767072 -75.632346 264 4 0
3 Idaho US 43.452658 -116.241552 671 16 0
4 Iowa US 41.330756 -94.471059 1 0 0
... ... ... ... ... ... ... ...
3172 NaN West Bank and Gaza 31.952200 35.233200 344 2 76
3173 NaN Western Sahara 24.215500 -12.885800 6 0 5
3174 NaN Yemen 15.552727 48.516388 6 2 1
3175 NaN Zambia -13.133897 27.849332 106 3 55
3176 NaN Zimbabwe -19.015438 29.154857 40 4 5

3177 rows × 7 columns

In [4]:
global_accumulate.isna().sum()
Out[4]:
Provice/State     183
Country/Region      0
Lat                67
Long               67
Confirmed           0
Deaths              0
Recovered           0
dtype: int64
In [5]:
# if we do not know the latitude, longtitude and state, we just fill them with unknow
global_accumulate.fillna('unknow', inplace=True)
global_accumulate.isna().sum()
Out[5]:
Provice/State     0
Country/Region    0
Lat               0
Long              0
Confirmed         0
Deaths            0
Recovered         0
dtype: int64
In [6]:
global_accumulate.duplicated().sum()
Out[6]:
2
In [7]:
global_accumulate.drop_duplicates(inplace=True)
global_accumulate.duplicated().sum()
Out[7]:
0
In [8]:
len(global_accumulate['Country/Region'].unique())
Out[8]:
187
In [9]:
# group by country
global_accumulate_byC = global_accumulate.groupby('Country/Region').agg({'Confirmed': 'sum', 
                                                'Deaths': 'sum',
                                                'Recovered': 'sum'}).sort_values('Confirmed', ascending=False)
global_accumulate_byC.head()
Out[9]:
Confirmed Deaths Recovered
Country/Region
US 1069424 62996 153947
Spain 213435 24543 112050
Italy 205463 27967 75945
United Kingdom 172481 26842 859
France 167299 24410 50380
In [10]:
# Choropleth Map COVID-19 confirmed cases around the world
global_accumulate_map = global_accumulate_byC.reset_index()[['Country/Region', 'Confirmed']]


colors = ["#F9F9F5", "#FAFAE6", "#FCFCCB", "#FCFCAE",  "#FCF1AE", "#FCEA7D", "#FCD97D",
          "#FCCE7D", "#FCC07D", "#FEB562", "#F9A648",  "#F98E48", "#FD8739", "#FE7519",
          "#FE5E19", "#FA520A", "#FA2B0A", "#9B1803",  "#861604", "#651104", "#570303",]


fig = go.Figure(data=go.Choropleth(
    locationmode = "country names",
    locations = global_accumulate_map['Country/Region'],
    z = global_accumulate_map['Confirmed'],
    text = global_accumulate_map['Confirmed'],
    colorscale = colors,
    autocolorscale=False,
    reversescale=False,
    colorbar_title = 'Reported Covid-19 Cases',
))

fig.update_layout(
    title_text='COVID19 Confirmed Cases By Country',
    geo=dict(
        showcoastlines=True,
    ),
)
In [11]:
# get the top ten countries in list for further use
top_10 = global_accumulate_byC.index.tolist()[:10]
In [12]:
# Total confirmed cases in top ten countries 
plt.figure(figsize=(15,5))
sns.set(font_scale=1.5)

sns.barplot(x=global_accumulate_byC.Confirmed[:10],y=global_accumulate_byC.index[:10])

plt.title('Total Confirmed Cases In Top Ten Countries', pad=20, fontsize=20);
In [13]:
# Total death cases in top ten countries
plt.figure(figsize=(15,5))

sns.barplot(x=global_accumulate_byC.sort_values('Deaths', ascending=False).Deaths[:10],y=global_accumulate_byC.sort_values('Deaths', ascending=False).index[:10])

plt.xlabel('Deaths')
plt.title('Total Deaths In Top Ten Countries', pad=20, fontsize=20);
In [14]:
# plot the death rate in top 10 countries
global_accumulate_byC['Death Rate'] = global_accumulate_byC.Deaths/global_accumulate_byC.Confirmed

plt.figure(figsize=(10,4.8))
ax = sns.barplot(x=global_accumulate_byC.sort_values('Death Rate', ascending=False).iloc[:10, 3], y=global_accumulate_byC.sort_values('Death Rate', ascending=False).index[:10])
ax.set_title('Fatality In Top Ten Countries', pad=20, fontsize=20)

# set individual bar lables
for i in ax.patches:
    # get_width pulls left or right; get_y pushes up or down
    ax.text(i.get_width()+0.001, i.get_y()+0.5, str(round(i.get_width()*100, 2))+'%', fontsize=12)

From the analysis above, we can get the following information:

  1. The United States is the country with most cases in both confirmed and deaths.
  2. The Confirmed cases and deaths are similar, only two countries different.
  3. The countries of fatality is much different with confirmed and deaths. Even though US has the most confirmed and deaths cases, it do not show in the top ten countries in fatality.
  4. The reasons for the third insight might be the medical condition is much worse in the countries like Yemen, MS Zaandam and Nicaragua, so most of people confirmed with COVID-19 were dead.
In [15]:
global_confirmed_daily = ak.covid_19_csse_global_confirmed()
global_confirmed_daily.head()
Out[15]:
Province/State Country/Region Lat Long 1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20 ... 5/1/20 5/2/20 5/3/20 5/4/20 5/5/20 5/6/20 5/7/20 5/8/20 5/9/20 5/10/20
0 NaN Afghanistan 33.0000 65.0000 0 0 0 0 0 0 ... 2335 2469 2704 2894 3224 3392 3563 3778 4033 4402
1 NaN Albania 41.1533 20.1683 0 0 0 0 0 0 ... 782 789 795 803 820 832 842 850 856 868
2 NaN Algeria 28.0339 1.6596 0 0 0 0 0 0 ... 4154 4295 4474 4648 4838 4997 5182 5369 5558 5723
3 NaN Andorra 42.5063 1.5218 0 0 0 0 0 0 ... 745 747 748 750 751 751 752 752 754 755
4 NaN Angola -11.2027 17.8739 0 0 0 0 0 0 ... 30 35 35 35 36 36 36 43 43 45

5 rows × 114 columns

In [16]:
# we do not care about the state and the location of the state
global_confirmed_daily = global_confirmed_daily.drop(['Province/State', 'Lat', 'Long'], axis=1)
global_confirmed_daily.head()
Out[16]:
Country/Region 1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20 1/28/20 1/29/20 1/30/20 ... 5/1/20 5/2/20 5/3/20 5/4/20 5/5/20 5/6/20 5/7/20 5/8/20 5/9/20 5/10/20
0 Afghanistan 0 0 0 0 0 0 0 0 0 ... 2335 2469 2704 2894 3224 3392 3563 3778 4033 4402
1 Albania 0 0 0 0 0 0 0 0 0 ... 782 789 795 803 820 832 842 850 856 868
2 Algeria 0 0 0 0 0 0 0 0 0 ... 4154 4295 4474 4648 4838 4997 5182 5369 5558 5723
3 Andorra 0 0 0 0 0 0 0 0 0 ... 745 747 748 750 751 751 752 752 754 755
4 Angola 0 0 0 0 0 0 0 0 0 ... 30 35 35 35 36 36 36 43 43 45

5 rows × 111 columns

In [17]:
global_confirmed_daily_melt = pd.melt(global_confirmed_daily, id_vars=['Country/Region'])
global_confirmed_daily_melt.rename(columns={'variable': 'Date', 'value': 'Confirmed'}, inplace=True)
In [18]:
global_confirmed_daily_melt['Date'] = pd.to_datetime(global_confirmed_daily_melt['Date'])
global_confirmed_daily_melt
Out[18]:
Country/Region Date Confirmed
0 Afghanistan 2020-01-22 0
1 Albania 2020-01-22 0
2 Algeria 2020-01-22 0
3 Andorra 2020-01-22 0
4 Angola 2020-01-22 0
... ... ... ...
29255 Western Sahara 2020-05-10 6
29256 Sao Tome and Principe 2020-05-10 208
29257 Yemen 2020-05-10 51
29258 Comoros 2020-05-10 11
29259 Tajikistan 2020-05-10 612

29260 rows × 3 columns

In [19]:
# get the top 10 confirmed countries 
daily_confirmed_top10 = global_confirmed_daily_melt[global_confirmed_daily_melt['Country/Region'].apply(lambda c: c in top_10)]

# group by country and date then get sum of confirmed values
daily_confirmed_top10 = daily_confirmed_top10.groupby(['Country/Region','Date'])['Confirmed'].sum().reset_index()
In [20]:
# get the line plot
plt.figure(figsize=(20,10))

axis = sns.lineplot(x='Date', y='Confirmed', hue='Country/Region', data=daily_confirmed_top10)

axis.xaxis.set_major_locator(plt.MaxNLocator(10))
axis.set_xlabel('')
axis.set_title('Daily Confirmed In Top 10 Countries ',fontdict={'fontsize': 25, 'fontweight': 700, 'color': 'maroon'} ,pad=20);

From the graph above, we can get the following information:

  1. The trends is similar in other nine countries, except the United States, which lead the graph since middel of March, all of them are around 200000 or below.
  2. The trend of confirmed cases in US increased rapidly in the middle of March, 2020.

c. Analysis of daily new cases

China is not in top ten countries anymore, but it is the first country who is aware of this pandemic, so we put it here for consideration.

In [21]:
# read new cases data
new_case_daily = pd.read_csv('new_cases.csv')
new_case_daily
Out[21]:
iso_code location date total_cases new_cases total_deaths new_deaths total_cases_per_million new_cases_per_million total_deaths_per_million new_deaths_per_million total_tests new_tests total_tests_per_thousand new_tests_per_thousand tests_units
0 ABW Aruba 2020-03-13 2 2 0 0 18.733 18.733 0.0 0.0 NaN NaN NaN NaN NaN
1 ABW Aruba 2020-03-20 4 2 0 0 37.465 18.733 0.0 0.0 NaN NaN NaN NaN NaN
2 ABW Aruba 2020-03-24 12 8 0 0 112.395 74.930 0.0 0.0 NaN NaN NaN NaN NaN
3 ABW Aruba 2020-03-25 17 5 0 0 159.227 46.831 0.0 0.0 NaN NaN NaN NaN NaN
4 ABW Aruba 2020-03-26 19 2 0 0 177.959 18.733 0.0 0.0 NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
13462 NaN International 2020-02-28 705 0 4 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
13463 NaN International 2020-02-29 705 0 6 2 NaN NaN NaN NaN NaN NaN NaN NaN NaN
13464 NaN International 2020-03-01 705 0 6 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
13465 NaN International 2020-03-02 705 0 6 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
13466 NaN International 2020-03-10 696 -9 7 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN

13467 rows × 16 columns

In [22]:
# get top ten countries and convert the date column to datetime
top_10_iso = new_case_daily[new_case_daily['location'] != 'World'].groupby('iso_code')['total_cases'].max().sort_values(ascending=False)[:10]
new_case_top10 = new_case_daily[new_case_daily['iso_code'].apply(lambda c: c in top_10_iso)]
new_case_top10.date = pd.to_datetime(new_case_top10.date)
In [23]:
# plot the daily new cases based on three days rolling
fig = px.line(new_case_top10, x="date", y=new_case_top10.new_cases.rolling(3).mean(), color='location')

fig.update_layout(title='Daily New Confirmed In Top 10 Countries (rolling 3-day average)',
                   xaxis_title='',
                   yaxis_title='Daily New Cases')
In [24]:
# plot new confirmed cases vs total confirmed case in log scale
fig = px.line(new_case_top10, x="total_cases", y=new_case_top10.new_cases.rolling(3).mean(), color='location', log_x=True, log_y=True)

fig.update_layout(title='Trajectory of World COVID-19 Confirmed Cases (rolling 3-day average)',
                  xaxis_title='Total Confirmed Cases',
                  yaxis_title='New Confirmed Cases (3-day average)')

From the above graph, we can see that:

  1. China is the first country who is aware of the COVID-19, and the pandemic has been controled.
  2. Second group is Europe, the curve of most of them start to decrease.
  3. Confirmed cases in US still increase rapidly, we cannot see the trend of the COVID-19 start to be controled.

d. Analysis of US situation

In [25]:
# get the US comfirmed and deaths data with state code and FIPS
us_confirmed = pd.read_csv('us_confirmed.csv')
us_deaths = pd.read_csv('us_deaths.csv')
In [26]:
# merge two dataframes to US
us_confirmed = us_confirmed.iloc[:,[0,1,2,3,-1]]
us_deaths = us_deaths.iloc[:,[1,-1]]
us = pd.merge(us_confirmed, us_deaths, left_index=True, right_index=True)
us.rename(columns={'County Name_x': 'County Name','5/1/20_x': 'Confirmed', '5/1/20_y': 'Deaths'}, inplace=True)
us.drop('County Name_y', axis=1, inplace=True)
us
Out[26]:
countyFIPS County Name State stateFIPS Confirmed Deaths
0 0 Statewide Unallocated AL 1 0 0
1 1001 Autauga County AL 1 42 3
2 1003 Baldwin County AL 1 175 4
3 1005 Barbour County AL 1 42 1
4 1007 Bibb County AL 1 42 0
... ... ... ... ... ... ...
3190 56037 Sweetwater County WY 56 11 0
3191 56039 Teton County WY 56 65 1
3192 56041 Uinta County WY 56 6 0
3193 56043 Washakie County WY 56 5 0
3194 56045 Weston County WY 56 0 0

3195 rows × 6 columns

In [27]:
# Group the data by State code
us_confirmed_deaths = us.groupby('State')['Confirmed', 'Deaths'].sum().reset_index()
us_confirmed_deaths['Text'] = us_confirmed_deaths['State']+ '<br>' + 'Confirmed: '+ \
    us_confirmed_deaths['Confirmed'].astype('str') + '<br>' +'Deaths: ' + us_confirmed_deaths['Deaths'].astype('str')

us_confirmed_deaths.head()
Out[27]:
State Confirmed Deaths Text
0 AK 365 9 AK<br>Confirmed: 365<br>Deaths: 9
1 AL 7294 289 AL<br>Confirmed: 7294<br>Deaths: 289
2 AR 3310 64 AR<br>Confirmed: 3310<br>Deaths: 64
3 AZ 7962 330 AZ<br>Confirmed: 7962<br>Deaths: 330
4 CA 52206 2131 CA<br>Confirmed: 52206<br>Deaths: 2131
In [28]:
# plot the map in each state
warnings.filterwarnings("ignore")

colors = ["#F9F9F5", "#FAFAE6", "#FCFCCB", "#FCFCAE",  "#FCF1AE", "#FCEA7D", "#FCD97D",
          "#FCCE7D", "#FCC07D", "#FEB562", "#F9A648",  "#F98E48", "#FD8739", "#FE7519",
          "#FE5E19", "#FA520A", "#FA2B0A", "#9B1803",  "#861604", "#651104", "#570303",]


fig = go.Figure(data=go.Choropleth(
    locationmode = "USA-states",
    locations = us_confirmed_deaths['State'],
    z = us_confirmed_deaths['Confirmed'],
    text = us_confirmed_deaths['Text'],
    colorscale = colors,
    autocolorscale=False,
    reversescale=False,
    colorbar_title = 'Reported Cases',
))

fig.update_layout(
    title_text='Spread Of The COVID-19 Confirmed Cases In The U.S. States',
    geo = dict(
        scope='usa',
        projection=go.layout.geo.Projection(type = 'albers usa'),
        showlakes=True, # lakes
        lakecolor='rgb(255, 255, 255)'),
)
In [29]:
# drop the rows whose FIPS are 0
us_county = us[us.countyFIPS != 0]
us_county.head()
Out[29]:
countyFIPS County Name State stateFIPS Confirmed Deaths
1 1001 Autauga County AL 1 42 3
2 1003 Baldwin County AL 1 175 4
3 1005 Barbour County AL 1 42 1
4 1007 Bibb County AL 1 42 0
5 1009 Blount County AL 1 39 0
In [30]:
# plot the map about each county in US
us_county['countyFIPS'] = us_county['countyFIPS'].apply(lambda x: str(x).zfill(5))

colorscale = ["#f7fbff", "#ebf3fb", "#deebf7", "#d2e3f3", "#c6dbef", "#b3d2e9", "#9ecae1",
    "#85bcdb", "#6baed6", "#57a0ce", "#4292c6", "#3082be", "#2171b5", "#1361a9",
    "#08519c", "#0b4083", "#08306b"
]
endpts = list(np.linspace(1, 100, len(colorscale) - 1))
fips = us_county['countyFIPS'].tolist()
values = us_county['Confirmed'].tolist()


fig = ff.create_choropleth(
    fips=fips, values=values, scope=['usa'],
    binning_endpoints=endpts, colorscale=colorscale,
    show_state_data=False,
    show_hover=True,
    centroid_marker={'opacity': 0},
    asp=2.9, 
    title='Spread Of The COVID-19 Confirmed Cases In The U.S. Counties',
    legend_title='Reported Cases'
)

fig.layout.template = None
fig.show()
In [31]:
# Total Confirmed Cases in top 15 State
plt.figure(figsize=(15,5))
sns.set(font_scale=1.2)

ax = sns.barplot(x=us_confirmed_deaths.sort_values('Confirmed', ascending=False).Confirmed[:15],y=us_confirmed_deaths.sort_values('Confirmed', ascending=False).State[:15])

plt.xlabel('Total Confirmed')
plt.title('Total Confirmed Cases In Top Fifteen States In US', pad=20, fontsize=20)

# set individual bar lables
for i in ax.patches:
    # get_width pulls left or right; get_y pushes up or down
    ax.text(i.get_width()+0.001, i.get_y()+0.5, int(i.get_width()), fontsize=12);
In [32]:
# Total Deaths in top 15 State
plt.figure(figsize=(15,5))


ax = sns.barplot(x=us_confirmed_deaths.sort_values('Deaths', ascending=False).Deaths[:15],y=us_confirmed_deaths.sort_values('Deaths', ascending=False).State[:15])

plt.xlabel('Total Deaths')
plt.title('Total Deaths In Top Fifteen States In US', pad=20, fontsize=20)

# set individual bar lables
for i in ax.patches:
    # get_width pulls left or right; get_y pushes up or down
    ax.text(i.get_width()+0.001, i.get_y()+0.5, int(i.get_width()), fontsize=12);

In the analysis of US, we can see that:

  1. New York is the state with most confirmed cases and deaths, and much more than other states.
  2. There are some clustering, such as around the New York, California, Washington and Florida.
  3. The situation is much better in the middle, one of the reasons might be states in east and west like New York and California are international states, and the population density of them is much higher.

Question2: What might be the real patients number?

SIR disease model

  • S = S(t) is the number of susceptible individuals
  • I = I(t) is the number of infected individuals
  • R = R(t) is the number of recovered individuals
  • Beta: Infection rate
  • Gamma: Recover rate
In [33]:
def func_I(t, beta):
  return math.e ** ((beta - 1 / 14) * t)
 
xdata = [0, 57, 58, 59, 60, 61]
ydata = [3, 239383, 243365, 245847, 250062, 254462]
 
popt, pcov = curve_fit(func_I, xdata, ydata)
 
beta = popt[0]
gamma = 1 / 14

 
print('beta:', beta, 'gamma:', gamma) 
beta: 0.2797604172855171 gamma: 0.07142857142857142
In [34]:
# Total population, N.
N = 19450000
# Initial number of infected and recovered individuals, I0 and R0.
I0, R0 = 3, 0
# Everyone else, S0, is susceptible to infection initially.
S0 = N - I0 - R0
# A grid of time points (in days)
t = np.linspace(0, 180, 180)

# The SIR model differential equations.
def deriv(y, t, N, beta, gamma):
    S, I, R = y
    dSdt = -beta * S * I / N
    dIdt = beta * S * I / N - gamma * I
    dRdt = gamma * I
    return dSdt, dIdt, dRdt

# Initial conditions vector
y0 = S0, I0, R0
# Integrate the SIR equations over the time grid, t.
ret = scipy.integrate.odeint(deriv, y0, t, args=(N, beta, gamma))
S, I, R = ret.T

# Plot the data on three separate curves for S(t), I(t) and R(t)
fig = plt.figure(figsize=(15,5))
ax = fig.add_subplot(111, axisbelow=True)
ax.plot(t, S/1000000, 'b', alpha=0.5, lw=2, label='Susceptible')
ax.plot(t, I/1000000, 'r', alpha=0.5, lw=2, label='Infected')
ax.plot(t, R/1000000, 'y', alpha=0.5, lw=2, label='Recovered with immunity')
ax.set_xlabel('Time /days')
ax.set_ylabel('Number (million)')
ax.yaxis.set_tick_params(length=0)
ax.xaxis.set_tick_params(length=0)
ax.grid(b=True, which='major', c='w', lw=2, ls='-')
legend = ax.legend()
legend.get_frame().set_alpha(0.5)
for spine in ('top', 'right', 'bottom', 'left'):
    ax.spines[spine].set_visible(False)

Summary

Based on 27.9% infection rate and 7.1% recover rate, the infected will hit 7.5 million peak in New York State.

Question3: What is the best model to predict Number of Confirmed Cases?

  • Models: Linear regression, Logistic regression and SVM
  • Confirmed cases in the US
In [35]:
covid_19_data_F = pd.read_csv('covid_19_clean_complete.csv')
time_series_covid_19_confirmed_global_F = pd.read_csv('time_series_covid_19_confirmed.csv')
time_series_covid_19_confirmed_US_F=pd.read_csv('time_series_covid_19_confirmed_US.csv')
In [36]:
# Visualizing the number of confirmed cases in the US by day
list2_F=time_series_covid_19_confirmed_US_F.iloc[:,11:].sum(axis=0)
time_series_covid_19_confirmed_US_F = time_series_covid_19_confirmed_US_F.append(list2_F, ignore_index=True)

time_series_covid_19_confirmed_US_T=time_series_covid_19_confirmed_US_F.T

time_series_covid_19_confirmed_US_T.iloc[11:,3253].dropna().plot()
plt.ylabel('Number of confirmed cases by day')
plt.xlabel('Days from 1/22/20')
plt.title('Number of Confirmed cases in the US');
In [37]:
# Create a new dataframe only shows the total number of confirmed cases in US by day and for whole state and the day time accordingly in order to make predictive analysis later.
list2_US=time_series_covid_19_confirmed_US_T[3253].dropna()
Number_of_cases_US=list2_US.to_frame().reset_index()
Number_of_cases_US=Number_of_cases_US.drop('index',axis=1)
Number_of_cases_US=Number_of_cases_US.rename(columns={3253: "number_of_confirmed_cases_US"})
len(Number_of_cases_US)
Out[37]:
76

Build up basic Linear regression and SVM to fit and predict number of confirmed cases

For all predictive models:
Change standardized date format to integer start from 1 to 77 to have a better visualization later.
Because the date of our data analysis is from 01/22 to 04/06, a total of 77 days.

Choose 10 days for future prediction

In [38]:
days_in_future = 10
future_forcast = np.array([i for i in range(len(Number_of_cases_US)+days_in_future)]).reshape(-1, 1)
adjusted_dates = future_forcast[:-10]
In [39]:
start = '1/22/2020'
start_date = dt.datetime.strptime(start, '%m/%d/%Y')
future_forcast_dates = []
for i in range(len(future_forcast)):
    future_forcast_dates.append((start_date + dt.timedelta(days=i)).strftime('%m/%d/%Y'))
In [40]:
# Split arrays into training data and test data which will be randomly selected
X_F1 =list(range(1, 77))
X_F1 = np.array(X_F1).reshape(-1, 1)
In [41]:
y_U = Number_of_cases_US['number_of_confirmed_cases_US']
y_U = np.array(Number_of_cases_US['number_of_confirmed_cases_US'])
In [42]:
X_train_confirmed_U, X_test_confirmed_U, y_train_confirmed_U, y_test_confirmed_U = train_test_split(X_F1, y_U, test_size=0.25, shuffle=False) 

Linear Regression model

In [43]:
poly_U = PolynomialFeatures(degree=6)

poly_X_train_confirmed_U = poly_U.fit_transform(X_train_confirmed_U)
poly_X_test_confirmed_U = poly_U.fit_transform(X_test_confirmed_U)
poly_future_forcast_U = poly_U.fit_transform(future_forcast)

linear_model_U = LinearRegression(normalize=True, fit_intercept=False)
linear_model_U.fit(poly_X_train_confirmed_U, y_train_confirmed_U)
test_linear_pred_U = linear_model_U.predict(poly_X_test_confirmed_U)
linear_pred_U = linear_model_U.predict(poly_future_forcast_U)

print('MAE:', mean_absolute_error(test_linear_pred_U, y_test_confirmed_U))
print('MSE:',mean_squared_error(test_linear_pred_U, y_test_confirmed_U))
MAE: 64323.49842801511
MSE: 6271869698.542065

SVM model

In the SVM model, there are a number of parameters that need to be tune, such as kernel which including 'linear’, ‘poly’ 'sigmoid' and so on, degree(can be used by a lot of certain number), gamma, and shrinking. The SVM model needs to adjust all these parameters to achieve the optimal fitting result.

In [44]:
svm_confirmed_U = SVR(shrinking=True, kernel='poly',gamma=0.01, epsilon=1,degree=6, C=0.1)
svm_confirmed_U.fit(X_train_confirmed_U, y_train_confirmed_U)
svm_pred_U = svm_confirmed_U.predict(future_forcast)
svm_test_pred_U = svm_confirmed_U.predict(X_test_confirmed_U)

print('MAE:', mean_absolute_error(svm_test_pred_U, y_test_confirmed_U))
print('MSE:',mean_squared_error(svm_test_pred_U, y_test_confirmed_U))
MAE: 136470.06241908937
MSE: 29987727336.379265
In [45]:
# plot the graph of SVM model and Linear Regression
plot1=plt.plot(X_F1,y_U,label="Number of confirmed cases actual in US", color='purple')
plot2=plt.plot(future_forcast,svm_pred_U,label='SVM Model in US', color ='pink',linestyle='dashed')
plot3=plt.plot(future_forcast,linear_pred_U, label='Linear Regression Model in US',color ='green',linestyle='dashed')

plt.xlabel('for days')
plt.ylabel('Number of confirmed cases in US')
plt.title('Actual vs. Linear Regression vs. SVM prediciton of confirmed cases in US')
plt.legend();

Logistic model

In order to achieve a better fitting effect, we not only used the logistics model, but referred to the model specially used for the study of infectious diseases. From this logistic increase model, It is to use the existing data to fit the above equation, P(t) is number of people confirmed; K means maximum number; R: growth resistance (e.g. resistance to the growth of an outbreak caused by medical isolation). As can be seen from the visualizations, compared with other basic models, this curve has the best good fitting effect.

In [46]:
def logistic_increase_function(t, K, P0, r):
    r=0.29
    t0 = 1
    exp_value = np.exp(r * (t - t0))
    return (K * exp_value * P0) / (K + (exp_value - 1) * P0)
In [47]:
t_F1 = list(range(1, 77))
t_F1 = np.array(t_F1)
In [48]:
P_US = Number_of_cases_US['number_of_confirmed_cases_US']
P_US = np.array(P_US)
popt, pocv = curve_fit(logistic_increase_function, t_F1, P_US)
P_US_predict=logistic_increase_function(t_F1,popt[0],popt[1],popt[2])

print('MAE:', mean_absolute_error(P_US, P_US_predict))
print('MSE:',mean_squared_error(P_US, P_US_predict))
MAE: 2104.8391003950337
MSE: 21155446.989056706
In [49]:
# Visualizations for actual trend line and predictive lines under the Logistic Regression
plot1=plt.plot(t_F1,P_US,label="Number of confirmed cases actual")
plot2=plt.plot(t_F1,P_US_predict,label='logistic Regression Model')

plt.xlabel('for days')
plt.ylabel('Number of confirmed cases')
plt.legend()
plt.title('Actual vs. Logistic prediction of confirmed cases in US');

Comparing MAE and MSE of different models (Linear Model, SVM Model and Logistic Model)

By comparing the absolute error values and mean squared error of different models, we can find that the logistic model is the most consistent with our actual confirmed cases lines. Again, the main reason that the logistics model is more closely related to the formulation used infectious diseases.\ The linear model is the most commonly used and familiar model, and its fitting is ok. I also put the Polyfeature to be used, so there is still one parameter to be adjusted.

As for the SVM model, there are many parameters that need to be adjusted, such as the kernels, gamma,shrinking and so on. Due to the limited data and limited time, the optimal parameters may not be found well, but I think SVM can achieve better results.

In [50]:
MAE = {
    'US':[mean_absolute_error(test_linear_pred_U, y_test_confirmed_U),mean_absolute_error(svm_test_pred_U, y_test_confirmed_U),mean_absolute_error(P_US, P_US_predict)]
}
MAE_table = pd.DataFrame(MAE) 
MAE_table.index = ['Linear Model', 'SVM Model', 'Logistic Model'] 
MAE_table
Out[50]:
US
Linear Model 64323.498428
SVM Model 136470.062419
Logistic Model 2104.839100
In [51]:
MSE = {
    'US':[mean_squared_error(test_linear_pred_U, y_test_confirmed_U),mean_squared_error(svm_test_pred_U, y_test_confirmed_U),mean_squared_error(P_US, P_US_predict)]
}
MSE_table = pd.DataFrame(MSE) 
MSE_table.index = ['Linear Model', 'SVM Model', 'Logistic Model'] 
MSE_table
Out[51]:
US
Linear Model 6.271870e+09
SVM Model 2.998773e+10
Logistic Model 2.115545e+07

Summary

By comparing the absolute error values and mean squared error of different models, we can find that the logistic model is the most consistent with our actual confirmed cases lines. Again, the main reason that the logistics model is more closely related to the formulation used infectious diseases. The linear model is the most commonly used and familiar model, and its fitting is ok. We also put the Polyfeature to be used, so there is still one parameter to be adjusted. As for the SVM model, even these four difference lines have different parameters to adjust to achieve the best performance.In addition, there are many parameters that need to be adjusted, such as the kernels, gamma,shrinking and so on. Due to the limited data and limited time, the optimal parameters may not be found well, but I think SVM can achieve better results.

Question4: What factors are playing crucial roles in determining whether a country is more likely to have a massive outbreak?

In [52]:
#read country code 
country_code=pd.DataFrame(pd.read_csv("wikipedia-iso-country-codes.csv").rename(columns={'English short name lower case':'Country/Territory',"Alpha-3 code":"Code"})[["Country/Territory","Code"]])

covid_data=pd.read_csv("covid_19_clean_complete.csv")
covid=pd.DataFrame(covid_data[covid_data['Date']=='4/23/20'])
covid19 = covid.groupby(by='Country/Region').agg({'Confirmed': 'sum','Deaths': 'sum','Recovered':'sum'}).sort_values(by="Confirmed",ascending=False).reset_index().rename(columns={'Country/Region':'Country/Territory'})
covid19.loc[0,"Country/Territory"]="United States"

#read confirm number wirh country and code
confirm_num=pd.merge(covid19,country_code,on='Country/Territory',how='left').reset_index(drop=True).sort_values(by="Confirmed",ascending=False)

Variable 1 - Age

As we know, senior people have higher risk infecting with covid19. We read the data that the population aged over 65 of each country in percentage.

In [53]:
aging=pd.read_html('https://en.wikipedia.org/wiki/List_of_countries_by_age_structure',header=1)[0][["Country",'age over 65 years[3]']].rename(columns={"Country":"Country/Territory","age over 65 years[3]":"Over_65"})
aging=pd.merge(aging,country_code,on='Country/Territory',how='left').dropna()
aging.head()
Out[53]:
Country/Territory Over_65 Code
0 Afghanistan 2.6 % AFG
1 Albania 13.2 % ALB
2 Algeria 6.2 % DZA
3 Angola 2.4 % AGO
4 Antigua and Barbuda 6.9 % ATG

Variable 2 - The income class

For the current 2020 fiscal year, low-income economies are defined as those with a GNI per capita, calculated using the World Bank Atlas method. $1,025 or less in 2018; lower middle-income economies are those with a GNI per capita between $1,026 and $3,995; upper middle-income economies are those with a GNI per capita between $3,996 and $12,375; high-income economies are those with a GNI per capita of $12,376 or more.

In [54]:
Country_Class=pd.DataFrame(pd.read_csv("Income Class.csv",index_col="CountryName").dropna().rename_axis("Country/Territory").rename(columns={"CountryCode":'Code'})).drop(columns="GroupName")
Country_Class.head()
Out[54]:
GroupCode Code
Country/Territory
Aruba HIC ABW
Andorra HIC AND
United Arab Emirates HIC ARE
Antigua and Barbuda HIC ATG
Australia HIC AUS

Variable 3-The Worldwide Governance Indicators (WGI)

Estimate of governance (ranges from approximately -2.5 (weak) to 2.5 (strong) governance performance)

In [55]:
WGI=pd.DataFrame(pd.read_csv("WGI.csv",index_col="Country/Territory")).rename({"Estimate":"WGI"},axis="columns")
WGI.head()
Out[55]:
Code WGI
Country/Territory
Aruba ABW 1.06
Andorra ADO 1.94
Afghanistan AFG -1.46
Angola AGO -1.05
Anguilla AIA 0.83

Variable 4 - Population Density

This is a list of countries and dependent territories ranked by population density, measured by the number of human inhabitants per square kilometer, and also sortable by total area and by population.

In [56]:
pop_dens=pd.read_csv("API_EN.POP.DNST_DS2_en_csv_v2_988966.csv",index_col='Country Name')[['Country Code','2018']].rename_axis("Country/Territory").rename(columns={'Country Code':"Code","2018":"Pop_dens"})
pop_dens.head()
Out[56]:
Code Pop_dens
Country/Territory
Aruba ABW 588.027778
Afghanistan AFG 56.937760
Angola AGO 24.713052
Albania ALB 104.612263
Andorra AND 163.842553

Variable 5 - Barro Lee - Education level

Year of Average Years of Tertirary Schooling Attained

In [57]:
barro_lee=pd.read_csv("Education lelvel.csv",index_col='country')[["WBcode","yr_sch_ter"]].rename_axis('Country/Territory').rename(columns = {'WBcode':'Code'})
barro_lee.head()
Out[57]:
Code yr_sch_ter
Country/Territory
Algeria DZA 0.37
Benin BEN 0.12
Botswana BWA 0.12
Burundi BDI 0.03
Cameroon CMR 0.09

Variable 6 - health spending

Health spending measures the final consumption of health care goods and services (i.e. current health expenditure) including personal health care (curative care, rehabilitative care, long-term care, ancillary services and medical goods) and collective services (prevention and public health services as well as health administration)

In [58]:
health_exp=pd.read_csv("2.12_Health_systems.csv")[["World_Bank_Name",'Health_exp_per_capita_USD_2016']].rename(columns={"World_Bank_Name":'Country/Territory','Health_exp_per_capita_USD_2016':'Health_exp_per_capita_USD'})
health_exp=pd.merge(country_code,health_exp,on='Country/Territory',how='left')

health_exp.sort_values(by='Code').head()
Out[58]:
Country/Territory Code Health_exp_per_capita_USD
12 Aruba ABW NaN
0 Afghanistan AFG 57.2
6 Angola AGO 95.2
7 Anguilla AIA NaN
1 Ã…land Islands ALA NaN
In [59]:
# Merge Multiple df: https://stackoverflow.com/questions/44327999/python-pandas-merge-multiple-dataframes/44338256
data_frames = [confirm_num, aging, Country_Class,WGI,pop_dens,barro_lee,health_exp]
df_merged = reduce(lambda  left,right: pd.merge(left,right,on=['Code'],
                                            how='left'), data_frames)
In [60]:
df_merge=df_merged.drop(columns=["Country/Territory_x","Country/Territory_y"]).set_index("Country/Territory")
data=df_merge.sort_values(by="Confirmed",ascending=False).dropna()
data.head()
Out[60]:
Confirmed Deaths Recovered Code Over_65 GroupCode WGI Pop_dens yr_sch_ter Health_exp_per_capita_USD
Country/Territory
United States 869170 49954 80203 USA 15.4 % HIC 1.58 35.766089 1.61 9869.7
Spain 213024 22157 89250 ESP 19.4 % HIC 1.00 93.529058 0.78 2389.9
Italy 189973 25549 57576 ITA 23.0 % HIC 0.41 205.450748 0.36 2738.7
France 158303 21889 42762 FRA 19.7 % HIC 1.48 122.338396 0.67 4263.4
Germany 153129 5575 103300 DEU 21.5 % HIC 1.62 237.370970 0.69 4714.3
In [61]:
data=data.reset_index().drop(columns=["Country/Territory","Deaths","Recovered","Code"])
data["Over_65"]=data["Over_65"].apply(lambda x: np.nan if x in ['-'] else x[:-1]).astype(float)/10
In [62]:
cat_col = data.select_dtypes(exclude=np.number)
cat_col_encoded = pd.get_dummies(cat_col,drop_first=True)
num_cols = data.select_dtypes(include = np.number)
df = pd.concat([num_cols,cat_col_encoded],sort=False,axis=1)

z=df.drop(columns="Confirmed",axis=1)
X=z.values
y = df['Confirmed'].values
In [63]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .1)

lr = LinearRegression()
lr.fit(X_train, y_train)
Out[63]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
In [64]:
for i, name in enumerate(z.columns): #enumerate adds a number to an iteration
    print(f'{name:>2}: {lr.coef_[i]}')  
print()    
# Intercept is the estimator’s **`intercept_`** attribute (**b** in the equation)
print(lr.intercept_)
Over_65: 4783.89131661811
WGI: -39811.39931107697
Pop_dens: -0.9487848010457186
yr_sch_ter: 41241.70887319549
Health_exp_per_capita_USD: 33.33805982677821
GroupCode_LIC: 356.7476441382068
GroupCode_MIC: 18285.22274936807

-45693.757551036266
In [65]:
country_group=data.groupby('GroupCode').sum()['Confirmed']

group=country_group.plot(kind='bar',rot=0,figsize=(10,10))
group.set_xlabel('Country Income Group') 
group.set_ylabel('Confirmed')
group.set_title('Confirmed by Income Group', fontdict={'fontsize': 20, 'fontweight': 700, 'color': 'maroon'}, pad=20)
plt.xticks(rotation='horizontal')

for p in group.patches:
    height = p.get_height()
    group.annotate('{}'.format(height),xy=(p.get_x() + p.get_width() / 2, height), ha='center', va='bottom')

As we can tell from this plot, the confirmed cases in the high income class is much higher than other two classes. The LIC only have less than 4000 confirmed in the whole class. As we know Covid19 is a global outbreak started from China, in the bigning, countries have more multinational enterprises required a lot of international travel has higher risks than countries whose economies are not very active.

Summary:

Linear regression model is not a good one for predicting the confirmed number of each country. But some of its inference is meaningful for us to understand how Covid-19 will affect different contries. The following are factors inferences I got from my analysis:

  1. According to the coefficient of WGI, we strong recommend the orgnization to include the efficiency to deal with pandamic as a factor when evaluating a government's efficiency
  2. Population density is not that crutial too. We belive no matter what is the population density, quarantine is cruitial to control Covid-19
  3. Barro_lee, Health-expenditure,income class are indexs to indicate how wealthy a country is. In our case, the more wealthy a country is, the higher risk the covid-19 will break out.

Question5: What insights can we obtain from the previous pandemic in better preparing global solutions for COVID-19?

This dataset contains the death information of every new pandemic/epidemic in the 21st century. It stores the number of death of each disease in a span of 800 days (124 days for COVID-19). By examining this dataset, we could analyze the increasing modes among the epidemics and gain insights over how deadly COVID-19 would end up to be.

In [66]:
df = pd.read_csv('Disease Data.csv')
df.head()
Out[66]:
NAME Cholera COVID-19 Ebola Swine Flu SARS Measles Meningitis
0 DAY 1 0 0.0 59 0 0 0 0
1 DAY 2 1 0.0 60 0 0 4 0
2 DAY 3 5 0.0 77 0 1 7 2
3 DAY 4 12 0.0 77 0 1 11 3
4 DAY 5 21 0.0 77 0 2 15 6
In [67]:
lines1 = df.plot.line(x='NAME', y=df.columns[1:8], logy = False)

Because the scales of the number of death are different, only Swine Flu are in similar scale with COVID-19 (Well, for now). We are going to implement a log scale of dependent variable to see better details of the comparison. We can still clearly see though how extremely deadly COVID-19 is compared to other epidemics.

In [68]:
lines = df.plot.line(x='NAME', y=df.columns[1:8], logy = True)

Summary

In this log scale plot, we can clearly see that COVID-19 was not the fastest killer as it did not surpass all other epidemics until around day 50. This finding agrees with our knowledge that COVID-19 has a relatively long incubation period compared to other viruses. We can also see a slowdown of its death toll at around day 50, which matches COVID-19's successful containment in China in late Feburary and it was causing too big a problem globally at that time.

Conclusion

After conducting all the analysis over the data and answering the business questions we threw out, we can now summarize our findings and validate them. We found out that different countries have their own increasing patterns due to strictness of their regulation and quarantine policy. By comparing different models' performance, we see that logistic model generally fits the data most accurately as it out-performs SVM and linear models in predicting total confirmed cases trends. This can be validated by China's case pattern as this pandemic is at its end in China. By utilizing SIR model, we also figure out that the real infected number is at least twice higher than reported, which can be validated by Governor Cuomo's assertion that NY's actual cases are much higher than reported. Governor Cuomo and his team in NY conducted antibody test randomly in NY's supermarket for 30000 people and their result exhibits a 13.4% antibody cases. If we apply this number to other area, we can conclude that the real case number is at least 2 or 3 times higher.

It is almost certainly that COVID-19 is the most severe pandemic humanity has faced since at least a century ago. Its high transmission ability and relatively high lethality put global healthcare service under huge pressure. Under current circumstance, every one of us should keep practicing social distancing and help the healthcare system. Fortunately, we are seeing another slowdown happening in above log scaled graph. This suggests that the number of death stopped accelerating but still stablized at a scary high level. We hope our little research can contribute towards global effort over fighting aginst COVID-19.